Question 1

a.)

True, if the response variable has correlation 0 with all the predictor variables, then the only predictor would be the intercept, with slope 0. In the simple linear regression case, this would be a horizontal line going through the data

b.)

True, even if predictor variables are perfectly correlated, the model can still be a good fit for the data. Thinking geometrically, the \(dim(X) <p\) in the case of multicollineairty. However, since the space exists, we can still fit the data.

c.)

True, because linear transformations have no effect on our ANOVA table, thus not changing our coefficients of multiple determination.

d.)

True, since the p-1 t-tests are not equivalent to testing whether there is a regression relation between Y and the set of X variables (as tested by the F test). When we have multicollinearity, this can be the case.

e.)

True, we can have a large amount of variables that are uncorrelated with each other, which individually can be significant but create a not significant p-value as a whole.

f.)

library(ggplot2)
library(GGally)
library(plotly)
property <- read.table("property.txt")
colnames(property) <-
  c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property

Question 8

a.)

ggplotly(ggplot(data = property, aes(x = Age, y = Ren.Rate)) + geom_point())

We can see from the plot that there is no tell of a linear relationship between the age of a property and it’s rental rate.

b.)

We have model equation: \[ Y_i = \beta_0 + \beta_1\tilde{X_{i1}} + \beta_2X_{i2} + \beta_4X_{i4} + \beta_1\tilde{X_{i1}^2} \] NOTE: can fit X1 for X1 tilde

property["AgeCent"] <- property$Age - mean(property$Age)
property["AgeSq"] <- property$AgeCent ^ 2

polyModel <-
  lm(Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
summary(polyModel)
## 
## Call:
## lm(formula = Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89596 -0.62547 -0.08907  0.62793  2.68309 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.019e+01  6.709e-01  15.188  < 2e-16 ***
## AgeCent     -1.818e-01  2.551e-02  -7.125 5.10e-10 ***
## AgeSq        1.415e-02  5.821e-03   2.431   0.0174 *  
## Exp          3.140e-01  5.880e-02   5.340 9.33e-07 ***
## Sq.Foot      8.046e-06  1.267e-06   6.351 1.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.097 on 76 degrees of freedom
## Multiple R-squared:  0.6131, Adjusted R-squared:  0.5927 
## F-statistic:  30.1 on 4 and 76 DF,  p-value: 5.203e-15
#Plotting Observations Against Fitted Values
ggplotly(
  ggplot() + aes(x = polyModel$fitted.values, y = property$Ren.Rate) + geom_point() + labs(x = "Fitted Values", y = "Observations", title = "Observartions against Fitted Values")
)

We have the regression function: \[ Y_i = 10.19 - 0.182X_{i1} + 0.314X_{i2} + 0.00008X_{i4} + 0.014X_{i1}^2 \] We find that our model is a good fit. We have a relatively good \(R^2_{adj}\) as well as fairly linear Observations against Fitted Values plot.

c.)

# Model 2
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
summary(model2)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0620 -0.6437 -0.1013  0.5672  2.9583 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.237e+01  4.928e-01  25.100  < 2e-16 ***
## Age         -1.442e-01  2.092e-02  -6.891 1.33e-09 ***
## Exp          2.672e-01  5.729e-02   4.663 1.29e-05 ***
## Sq.Foot      8.178e-06  1.305e-06   6.265 1.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared:  0.583,  Adjusted R-squared:  0.5667 
## F-statistic: 35.88 on 3 and 77 DF,  p-value: 1.295e-14

We find that both the \(R^2\) and \(R^2_{adj}\) are higher in the quadratic model than the Model 2. The \(R^2\) for Model 2 is \(0.583\) and \(0.6131\) for the quadratic model. The \(R^2_{adj}\) for Model 2 is \(0.5667\) and \(0.5927\) for the quadratic model. This would lead us to conclude that the quadratic model is a better fit than Model 2.

d.)

To test our full model versus our reduced model, we have: \[ H_0: \beta_j = 0\ \text{for all} \ j\in \mathbf J\\ H_a: \text{not all} \ \beta_j: \ j\in \mathbf J\\ \] With test statistic and null distribution: \[ F^* = \frac{\frac{SSE(R)-SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \\ F^* \sim F_{(1- \alpha, df_R - df_F, df_F)} \]

We reject \(H_0\) if \(F^* > F_{(1- \alpha, df_R - df_F, df_F)})\).

#Find crtical value
qf(1 - 0.05, 77-76, 77)
## [1] 3.965094
anova(polyModel, model2)

Given that our value for our \(F^*\) is \(5.9078\), we reject \(H_0\) and conclude that our quadratic term is significant in the model at \(\alpha = 0.05\).

e.)

#Our prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 15.11985 12.09134 18.14836
# Our prediction for quadratic model 
predict(polyModel, data.frame(AgeCent = 4, AgeSq = 16, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 13.47259 10.47873 16.46645

We can see that when we predict with a quadratic model, we get a lower valued interval than that of the prediction with Model 2.